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Abstract 

A natural way to characterize the cluster structure of a dataset is by finding regions containing a high 
density of data. This can be done in a nonparametric way with a kernel density estimate, whose modes 
and hence clusters can be found using mean-shift algorithms. We describe the theory and practice behind 
clustering based on kernel density estimates and mean-shift algorithms. We discuss the blurring and non¬ 
blurring versions of mean-shift; theoretical results about mean-shift algorithms and Gaussian mixtures; 
relations with scale-space theory, spectral clustering and other algorithms; extensions to tracking, to 
manifold and graph data, and to manifold denoising; A-modes and Laplacian A-modes algorithms; 
acceleration strategies for large datasets; and applications to image segmentation, manifold denoising 
and multivalued regression. 


1 Introduction 

One intuitive way of defining clusters is to assume that the data points are a sample of a probability density 
function, and then to define the clusters through this density. For example, fig. 1 shows a 2D dataset and 
a density estimate for it, whose contours clearly suggest that there are two clusters of a complex shape. 
The first step, then, is to learn an estimate of the density for the data points. This can be done with a 
parametric model, such as a Gaussian mixture, typically trained with an EM algorithm to maximize the 
likelihood [55]. Such an approach is often computationally efficient and can give good results with clusters 
of elliptical shape, but it has several disadvantages. The likelihood function will typically have local optima, 
and finding a global optimum is, in general, very difficult; thus, the result is dependent on the initialization, 
and in practice a user will try different initializations (usually random restarts). The selection of the model 
(what kernel and how many components) is left to the user, as well as the number of clusters to find. And 
when the clusters have complex shapes, as for example in image segmentation, many components will be 
required to approximate them well, increasing the training time and the number of local optima. 

We focus on nonparametric, kernel density estimates (KDE). A KDE is a generalization of histograms to 
define density estimates in any dimension that are smooth. They simplify the mathematical and computa¬ 
tional treatment of densities and, crucially, enable one to use continuous optimization to find maxima of the 
density. With a kernel such as the Gaussian kernel, a KDE requires a single user parameter, the bandwidth 
(also referred to as scale). Given the bandwidth, the KDE is uniquely determined and, as seen below, so 
will be its clusters, which can take complex nonconvex shapes. Hence, the user need not select the number 
of clusters or try random restarts. We will focus on clusters defined by the modes of the KDE (although this 
is not the only way to define the clusters). A mode is a local maximum of the density. A natural algorithm 
to find modes of a KDE is the mean-shift iteration, essentially a local average, described in section 2. The 
basic idea in mean-shift clustering is to run a mean-shift iteration initialized at every data point and then 
to have each mode define one cluster, with all the points that converged to the same mode belonging to the 
same cluster. Section 2.3 reviews theoretical results regarding the number and location of modes of a KDE, 

*To appear in: CRC Handbook of Cluster Analysis, edited by Roberto Rocci, Fionn Murtagh, Marina Meila and Christian 
Hennig. 
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Figure 1: Illustration in 2D of complex-shaped clusters, the kernel density estimate (KDE) and the mean- 
shift results. A; the dataset. The circle has a radius equal to the bandwidth a used in the KDE. B: a contour 
plot of the Gaussian KDE p(x) with bandwidth a. The KDE has two modes, located at the center of the 
blue ellipses. Each ellipse indicates the eigenvectors (rescaled to improve visibility) of the Jacobian J(x*) of 
eq. (5) at a mode x*. The dotted-line polygon is the convex hull of the data points. The paths followed by 
Gaussian MS for various starting points are shown. C: the resulting clustering using mean-shift, with each 
point color-coded according to the cluster it belongs to. D: the resulting clustering and KDE contours using 
a smaller bandwidth (indicated by the radius of the black circle). Now, the KDE defines three clusters. 


the convergence of mean-shift algorithms and the character of the cluster domains. Section 2.4 discusses re¬ 
lations of mean-shift algorithms with spectral clustering and other algorithms. Sections 2.5 and 2.6 describe 
extensions of mean-shift for clustering and manifold denoising, respectively. One disadvantage of mean-shift 
algorithms is their computational cost, and section 2.7 describes several accelerations. Section 3 describes 
another family of KDE-based clustering algorithms which are a hybrid of A-means and mean-shift, the 
A-modes and Laplacian A-modes algorithms, which find exactly A clusters and a mode in each, and work 
better with high-dimensional data. Section 4 shows applications in image segmentation, inverse problems, 
denoising, and other areas. We assume multivariate, continuous (i.e., not categorical) data throughout. 
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2 Problem formulation and mean-shift algorithms 


In mean-shift clustering, the input to the algorithm are the data points (multivariate, continuous feature 
vectors) and the bandwidth or scale. Call C the data points to be clustered. We define a kernel 

density estimate [83] 
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with bandwidth ct > 0 and kernel K{t), e.g. K{t) = for the Gaussian kernel or K{t) = 1 — t iit G [0,1) 
and 0 if t > 1 for the Epanechnikov kernel. Many of the results below carry over to kernels where each point 
has its own weight and its own bandwidth, which can be an isotropic, diagonal or full covariance matrix. 
To simplify the presentation we focus on the case where all points have the same, scalar bandwidth a (the 
isotropic, homoscedastic case) and the same weight ^ unless otherwise noted. This is the case found most 
commonly in practice. Also, we mostly focus on Gaussian kernels, which are easier to analyze and give rise 
to simpler formulas. 

We can derive a simple iterative scheme = f(x*^'^)) for r = 0,1, 2 ... to find a mode of p by equating 

its gradient to zero and rearranging terms (section 2.7 discusses other ways to find modes). We obtain 


f(x) = 


N 


K'l 




( 2 ) 


where K' = dK/dt and the vector f (x) — x is the mean shift, since it averages the individual shifts x„ — x 
with weights as above. For a Gaussian kernel, K' oe K and this simplifies to the following, elegant form 
(where, by Bayes’ theorem, p(n|x) = p(x|n)p(n)/p(x) is the posterior probability of the component centered 
at x„ given point x) [9]: 


p(n|x) 


exp(-g||(x-x„)/crf) 

E^'=iexp(-5 ||(x-x„0/o-f) 


N 

n—1 


(3) 


As discussed below, under mild conditions this scheme converges to modes of p from nearly any initial 
Intuitively, each step moves the iterate to a local average of the data, in that data points closer to 
have larger weight, and this increases the density. Eq. 2 is called the mean-shift iteration, and it can be 
used in two distinct ways: to find modes, and to filter (or smooth) a dataset. This gives rise to two different 
clustering algorithms, as follows. We will refer to them as mean shift (MS) (where modes are found) and 
blurring mean shift (BMS) (where the dataset is filtered). 


2.1 Two basic types of mean-shift algorithms: MS and BMS 

Clustering by mean-shift (MS): find modes Here, we declare each mode of p as representative of one 
cluster, and assign data point x„ to the mode it converges to under the mean-shift iteration, f°°(x„). The 
algorithm is given in fig. 2A for the Gaussian kernel. We can also estimate error bars for each mode from 
the local Hessian [8, 10], given in eq. (5), which is related to the local covariance. 

Some practical problems need to be solved. Firstly, some points (minima and saddle points) do not 
converge to modes. It is unlikely that this will happen with a finite sample, but if so such points can be 
detected by examining the Hessian or by a postprocessing step that checks for small clusters. 

Second, the mean-shift iteration is stopped after a finite number of steps, for example when the relative 
change in the value of x is smaller than a set tolerance tol > 0. This means that data points that in theory 
would converge to the same mode actually stop at numerically slightly different points. A postprocessing 
step is necessary to merge these into a unique mode. This can be done by finding the connected components^ 
of a graph that has N vertices, one for every convergence point, and has an edge between any pair of vertices 
lying within a small distance e > 0. The graph need not be explicitly constructed. The user should set tol 
small enough to converge to the modes with good accuracy, while limiting the computational cost incurred; 
and e should be set quite larger than tol, but smaller than the distance between different true modes. 

^The connected-components algorithm is described in appendix A. 
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A. Gaussian mean-shift (MS) algorithm 

B. Gaussian blurring mean-shift (BMS) algorithm 

for n G {1,... , N} 

X ^ x„ 
repeat 

W ^ 1 ^ , exp (-i||(x-x„)/<T||^) 

Vn: pin x) <- - -^ 

E^=iP(^|x)x„ 

until stop 

Z„ ^ X 

end 

connected-components ({z„}()L]^,e) 


repeat 

for TO G {!,..., N} 

Vn: p(n|x) .- exp (-^||(x,„-x„)/.||-) 

E;y, = iexp(-i||(x™-x„,)/<Tf) 

ym ^ I]„=lP(^^|Xm)x„ 

end 

Vto: x„ ^ Ym 
until stop 

connected-components ( {x„ } , e) 


C. Gaussian MS algorithm in matrix form 

D. Gaussian BMS algorithm in matrix form 

Z = X 
repeat 

W = (exp (-i \\{Zm - X„)/CTf 

D = diag Wnm) 

Q = WD-1 

Z = XQ 
until stop 

connected-components ({z„}))L]^,e) 


repeat 

W = (exp (-i II (x™ - x„)/crf 

D = diag (j2n=l 

P = WD-1 

X = XP 
until stop 

connected-components ( {x„ } , e) 



Figure 2: Pseudocode for MS (left) and BMS (right), in both loop and matrix forms, for the Gaussian kernel. 
In all cases, the input is a dataset xi,..., xn G and a bandwidth cr > 0. See section 2.1 for the stopping 
criterion and the connected-components threshold distance between points e. 


Clustering by blurring mean-shift (BMS): smooth the data Here, each point of the dataset 
actually moves to the point f(xm) given by eq. (2). That is, given the dataset X = {xi,... ,xn}, for each 
Xm G X we obtain a new point x™ by applying one step of the mean-shift algorithm: x^ = f(xm). Thus, 
one iteration of blurring mean-shift results in a new dataset X which is a blurred (smoothed) version of X. 
By iterating this process we obtain a sequence of datasets X(°\ X(^\ ... (and a sequence of kernel density 
estimates p*^°^(x),p*^^^(x),...) where is the original dataset and is obtained by blurring 
with one mean-shift step (see fig. 6). 

As will be shown below, Gaussian BMS can be seen as an iterated filtering (in the signal processing 
sense) that eventually leads to a dataset with all points coincident for any starting dataset and bandwidth. 
However, before that happens, the dataset quickly collapses into meaningful, tight clusters which depend on 
a (see fig. 6), and then these point-like clusters continue to move towards each other relatively slowly. A 
stopping criterion that detects this situation quite reliably is based on whether the entropy of the dataset 
changes [13] (a simpler criterion would be to stop when the update to X is small, but this does not always 
give good clusters). As with MS clustering, a connected-components postprocessing step merges the points 
into actual clusters. The BMS algorithm is given in fig. 2B. 

Similarities and differences between MS and BMS Although both MS and BMS are based on 
the same mean-shift iteration, they are different algorithms and can produce different clustering results. 
Specifically, given a value of the bandwidth, the number of clusters resulting from MS and BMS is usually 
different. However, the collection of clusterings produced over a range of bandwidths can be quite similar. 

BMS is quite faster than MS in number of iterations and in runtime, particularly if using the accelerated 
BMS algorithm (section 2.7), which introduces essentially no approximation error. However, MS (and also 
BMS) can be considerably accelerated if a small clustering error is tolerated (section 2.7). 

In MS, the N optimizations (one for each data point) proceed independently, but they could be done 
synchronously, as in BMS, without altering the result. However, practically this is wasteful, because the 
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number of iterations required varies considerably among points, and a synchronous scheme would have to 
run the largest number of iterations. Conversely, it is possible to run BMS with asynchronous iterations, for 
example moving points as soon as their update is computed. However, this makes the result dependent on 
the order in which points are picked, and is unlikely to be faster than the accelerated algorithm described 
below. 


Choice of bandwidth The fundamental parameter in mean-shift algorithms is the bandwidth a, which 
determines the number of clusters. The statistics literature has developed various ways to estimate the band¬ 
width of a KDE [77, 83], mostly in the ID setting, for example based on minimizing a suitable loss function 
(such as the mean integrated squared error), or more heuristic rules (such as making the bandwidth propor¬ 
tional to the average distance of each point to its fcth nearest neighbor). While these bandwidth estimators 
are useful and can give reasonable results, they should be used with caution, because the bandwidth that 
gives the best density estimate (in a certain sense) need not give the best clustering—clustering and density 
estimation are, after all, different problems. Besides, clustering is by nature exploratory, and it is best to 
explore a range of bandwidths. Computationally, this is particularly easy to do in MS (see the scale-space 
discussion below). 

It is also possible to use a different bandwidth (t„ for each point (called adaptive KDE), which can 
help with areas where points are sparse, for example. A good way to do this is with entropic affinities 
[45, 82], where the user sets a global number of neighbors k and then, for each data point n = 1,... ,N, 
the bandwidth (t„ is computed so that point n has a distribution over neighbors with a perplexity (log- 
entropy) k, i.e., each point sets its own bandwidth to have k effective neighbors. One could then vary k to 
achieve different clusterings. Other ways to construct adaptive KDEs for MS have been proposed [28]. The 
mean-shift update with adaptive bandwidths ci,..., ctat has the form [14]: 


qin\x) 


p{n\x)affi 


N 

n—1 


(4) 


where the q(m\x.) values are the posterior probabilities p(m|x) reweighted by the inverse variance and renor¬ 
malized (compare with the single-bandwidth equation (3)). 

Mean-shift has also been used to track moving objects (“blobs”) through a sequence of images [30]. Since 
the blobs can change size, using a fixed bandwidth (scale) becomes problematic. Collins [27] used ideas from 
scale-space theory (see below) to select a scale that adapts to the blob size. An “image feature” is defined 
as a point in scale-space where a certain differential operator achieves a local maximum (i.e., a mode) with 
respect to both space and scale. This mode is then tracked by a two-step mean-shift procedure that convolves 
in image space with a filterbank of spatial difference-of-Gaussian filters, and convolves in scale space with 
an Epanechnikov kernel, in an efficient way. 


Choice of kernel Different kernels give rise to different versions of the mean-shift and blurring mean- 
shift algorithms. Much previous work (including [24, 37]) uses the Epanechnikov kernel for computational 
efficiency, since the kernel evaluations involve only pairs of neighboring points (at distance < a) rather than 
all pairs of points (though the neighbors must still be found at each iteration), and convergence occurs 
in a finite number of iterations. However, in practice, the Gaussian kernel produces better results than 
the Epanechnikov kernel [29], which generates KDEs that are only piecewise differentiable and can contain 
spurious modes (see below). 

Hierarchical mean-shift and scale-space clustering The behavior of modes and other critical points 
(minima, saddles) as a function of the bandwidth (or scale) is the basis of scale-space theory in computer 
vision [50, 54, 91]. Here, one studies the evolution of an image under Gaussian convolution (blurring). If 
we represent the image as a sum of delta functions centred at the pixel locations and with value equal to 
the grayscale, convolving this with an isotropic Gaussian kernel of scale a gives a KDE (with weighted 
components). As the scale increases, the image blurs, structures in the image such as edges or objects lose 
detail, and for large scales the image tends to uniformly gray. Some important structures, such as objects, 
can be associated with modes of the KDE, and the lifetime of a mode—defined by the scale interval between 
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the creation and destruction (merging with another mode) of that mode—is taken as an indication of its 
importance in the image. 

An ideal convolution kernel should blur structure in the image but never create new structure (which 
would then reflect properties of the kernel rather than of the image). The result would then be a tree of 
modes, where there are N modes for cr = 0 (one mode at each component) and modes merge as a increases 
until there is a single mode. Unfortunately, this is only true for the Gaussian kernel and only in dimension one 
(section 2.3). In dimension two, i.e., with images, new modes can be created as the scale decreases. However, 
in practice these creation events seem rare and short-lived, in that the mode created usually merges with 
another mode or critical point at a slightly larger scale. Thus, they are generally ignored. Computationally, 
one starts with a mode at each pixel (or data point) for <7 = 0 and then tracks the location of the modes as 
in a numerical continuation method [61], by running mean-shift at a new, larger scale using as initial points 
the modes at the previous scale. By construction, this results in a tree—although, unlike in agglomerative 
clustering, the resulting set of clusterings at each scale need not be nested. 

Another notion of lifetime, topological persistence, has been explored in computational geometry [34], 
and used to define a hierarchical mean-shift image segmentation [23, 62]. The tree of modes has also been 
proposed in the statistical literature as a tool for visualization of KDEs [59], but this is practical only for 
small datasets in ID or 2D. The mode tree is sensitive to small changes in the data and gives no way to 
differentiate between important modes and those caused by, for example, outliers, so it can help to combine 
several trees constructed by jittering or resampling the original dataset [60]. 

Matrix formulation of BMS and generalizations of BMS As indicated in fig. 2D, we can equivalently 
write each Gaussian BMS iteration in matrix form [13] as X ^ XP in terms of the random-walk matrix 
P = WD 1, an N X N stochastic matrix with elements Pnm = p{n\xm) G (0,1) and '^n=iPnm = 1- Here, 
X = (xi, ... ,XAr) is the D x N matrix of data points, W = (exp (—i ||(xm — x„)/ct|| ))nm is a Gaussian 
affinity matrix, which defines a weighted graph where each x„ is a vertex, and D = diag (X]!^=i Wnm) is the 
degree matrix of that graph. This establishes a connection with spectral clustering, which we describe in 
section 2.4. 

Also, we can regard the iteration X ^ XP as a smoothing, or more generally a filtering [15, 33, 79, 80], 
of (each dimension of) the data X with an inhomogeneous filter P, where the filter depends on the data 
and is updated at each iteration as well, hence resulting in a nonlinear filtering. This in turn suggests that 
one could use other filters constructed as a function <()(P), where (/> is a scalar function, so it modifies the 
spectrum of P. Carreira-Perpinan [15] studied several such filters, including explicit, implicit, power and 
exponential ones, depending on a step size parameter 77 , resulting in generalized Gaussian blurring mean- 
shift algorithms. He gave convergence conditions on </> and p and found that the different filters tend to find 
similar clusters, i.e., over a range of bandwidths they can obtain the same clustering (at possibly different 
a values). However, their runtime varies widely. Implicit filters (which involve solving a linear system) or 
power filters (which involve iterating a matrix product) have a strong clustering effect in each iteration, 
but their iterations themselves are more costly. When one considers both the number of iterations and 
the cost of each iteration, the method found fastest was a slightly overrelaxed explicit function of the form 
(/)(P) = (1 — 77)1 -I- r/P with 77 « 1.25. However, its runtime was very close to that of the standard BMS 
(77 = 1). An interesting extension of this work would be to be able to design the function (p so that the 
resulting generalized BMS algorithm is optimal (in some sense) for clustering. 

It is also possible to write MS in matrix form (fig. 2C), where we write the random-walk matrix with 
the symbol Q to differentiate it from the standard random-walk matrix P, since Qnm = p(x„|zm) is defined 
on two sets of points (X and Z) while P is defined on one (X = Z). However, the matrix form implies that 
the MS updates for all points are synchronous, which, as mentioned before, is slow. 

Mapping new points to clusters (out-of-sample mapping) Having run MS or BMS on a dataset, 
how should we deal with new data points not in the original dataset? The purist option (and the only one 
for BMS) is to run the clustering algorithm from scratch on the entire dataset (old and new points), but this 
is computationally costly. A faster option with MS is to use the original KDE and simply run MS on each 
of the new points, assigning them to the mode they converge to. This reflects the fact that MS clusters not 
just the points in the dataset, but (implicitly) the whole space. However, this point of view implies no new 
clusters are created when new points arrive. 
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Advantages and disadvantages of mean-shift algorithms The advantages of mean-shift algorithms 
stem from the nonparametric nature of the KDE: (1) It makes no model assumptions (other than using a 
specific kernel), unlike Gaussian mixture models or A-means, for example. (2) It is able to model complex 
clusters having nonconvex shape (although this does not imply that all shapes can be modeled well), unlike 
A-means. (3) The user need only set one parameter, the bandwidth, which has an intuitive physical meaning 
of local scale, and this determines automatically the number of clusters. This is often more convenient than 
having to select the number of clusters explicitly. (4) It has no local minima, thus the clustering it defines is 
uniquely determined by the bandwidth, without the need to run the algorithm with different initializations. 
(5) Outliers, which can be very problematic for Gaussian mixtures and A-means, do not overly affect the 
KDE (other than creating singleton clusters). 

Using KDEs and equating modes with clusters as in mean-shift also has some disadvantages. The most 
important one is that KDEs break down in high dimensions, where the number of clusters changes abruptly 
from one for large a to many, with only a minute decrease in a. Indeed, most successful applications of 
mean-shift have been in low-dimensional problems, in particular image segmentation (using a few features 
per pixel, such as color in LAB space). A way of using modes in high-dimensional spaces is the A-modes 
algorithm described in section 3. 

Other possible disadvantages, depending on the case, are as follows. (1) In some applications (e.g. 
figure-ground or medical image segmentation) the user may seek a specific number of clusters. However, 
in mean-shift clustering we have no direct control over the number of clusters: to obtain A clusters, one 
has to search over cr. This is computationally costly (and sometimes not well defined, since the number 
of clusters might not be a monotonic function of cr). (2) We do not differentiate between meaningful and 
non-meaningful modes. For example, outliers will typically create their own mode; or, the density in a 
cluster may genuinely contain multiple modes (especially with clusters that have a nonconvex or manifold 
structure, as in fig. ID). Some of these problems may be partially corrected by postprocessing the results 
from mean-shift (e.g. to remove low-density modes and clusters with few points, which are likely outliers), or 
by the A-modes algorithm. Finally, mean-shift is slow computationally, and this is addressed in section 2.7. 

2.2 Origins of the mean-shift algorithm 

The mean-shift algorithm is so simple that it has probably been discovered many times. In 1975, Fukunaga 
and Hostetler [37] were perhaps the first to propose its idea and also introduced the term “mean shift”. They 
derived the blurring version of the algorithm (BMS) for a KDE with the Epanechnikov kernel as gradient 
ascent on logp(x) with a variable step size, without proving convergence or giving a stopping criterion. 
They observed that it could be used for clustering and dimensionality reduction (or denoising), since points 
converge locally to cluster centroids or medial lines for appropriate values of the bandwidth. Since 1981, 
the algorithm was also independently known in the statistics literature as “mean update algorithm” (see 
[81] pp. 167ff and references therein). The term “blurring process” is due to Gheng [24], who discussed the 
convergence of both the blurring (BMS) and the non-blurring (MS) versions of mean-shift. Garreira-Perpinan 
[9], motivated by the problem of finding all the modes of a Gaussian mixture for multivalued regression [8, 10], 
independently rediscovered the algorithm for the Gaussian kernel and proved its convergence for arbitrary 
covariance matrices [10]. Since the early 2000s, the non-blurring mean-shift received much attention thanks 
to the work of Gomaniciu and Meer [29], who demonstrated its success in image filtering, image segmentation 
and later in tracking ]30]. This was followed by work by many researchers in both theoretical, computational 
and application issues. Algorithms similar to mean-shift have appeared in scale-space clustering [21, 68, 90, 
92], in clustering by deterministic annealing [69] and in pre-image finding in kernel-based methods [72]. 

2.3 Theoretical results about mean-shift algorithms and Gaussian mixtures 

Although MS and BMS are dehned by very simple iterative schemes, they exhibit some remarkable and 
somewhat counterintuitive properties regarding whether they converge at all, how fast they converge, and 
the character of the convergence domains. The geometry of the modes of a KDE is also surprising. The 
relevant literature is scattered over different areas, including computer vision, statistics and machine learning. 
We give a summary of results, with particular attention to Gaussian kernels, without proof. More details 
can be found in the references cited. 
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Geometry of the modes of a Gaussian mixture Intuitively, one might expect that a sum of N 
unimodal kernels with bandwidth a as in eq. (1) would have at most N modes, and that the number of 
modes should decrease monotonically as a increases from zero and the different components coalesce. In 
general, this is only true for the Gaussian kernel in dimension one. Motivated by scale-space theory, several 
papers [3, 50, 95] showed that, in dimension one, the Gaussian kernel is the only kernel that does not create 
modes as the scale increases. It is easy to see that, in KDEs with non-gaussian kernels (Epanechnikov, 
Gauchy, etc.), modes can appear as a increases [18]. The creation of modes need not occur often, though, 
and some kernels (such as the Epanechnikov kernel) are more likely than others to create modes. It is less 
easy to see, but nonetheless true, that modes can also appear with the Gaussian kernel in dimension two, 
i.e., with images, and thus in any larger dimension, as shown by an example in [53]. 

The scale-space theory results were restricted to a single bandwidth a, and also the creation of modes 
does not necessarily imply that a mixture with N components may have more than N modes. The results for 
the general case of Gaussian mixtures (GMs) are as follows. Again, there is a qualitative difference between 
ID and 2D or more. A GM (with possibly different bandwidths (j„) can have at most N modes in ID [19] 
but more than N modes in 2D or above, even if the components are isotropic [20]. In 2D, if the components 
have diagonal or full covariance matrices, it is easy to construct examples with more modes than components 
[10, 19]. It is far harder to achieve this if all the components are isotropic with equal bandwidth ct, but still 
possible. This was first shown by a construction suggested by Duistermaat and studied by Garreira-Perpihan 
and Williams [20], consisting of 3 Gaussians in the vertices of an equilateral triangle. For a narrow interval 
of scales (lifetime), an extremely shallow fourth mode appears at the triangle barycenter. They generalized 
this construction to dimension D as a regular simplex with a Gaussian in each vertex, i.e., a GM with D -\-l 
components, one at each vertex of the simplex [20]. They showed that the number of modes is either D + 1 
(for small scale, modes near the vertices), 1 (for large scale, at the simplex barycenter) or D-l-2 (for a narrow 
range of intermediate scales, modes at the barycenter and near the vertices). Interestingly, the lifetime of 
the mode that is created in the barycenter peaks at D = 698 and then slowly decreases towards zero as D 
increases. Small perturbations (but not vanishingly small) of the construction prevent the creation of the 
extra mode, which suggests that it may be unlikely for isotropic GMs to have more modes than components. 
However, apart from a few isolated studies, the geometry of the modes of GMs in high dimensions is poorly 
understood. 

As for the location of the modes of a GM, they are in the interior of the convex hull of the data points 
if the components are isotropic (with possibly different bandwidths (t„), for any dimension [10, 19], as seen 
in fig. IB. With KDEs, it is easy to see this from the fact that eq. (2) is a convex sum if K' > 0. If the 
components have diagonal or full covariance matrices, the modes can be outside the convex hull [10, 18, 19]. 

Most work on mean shift and scale space theory tacitly assumes that the modes of a GM are finite in 
number or at least are isolated. Although sometimes this is claimed to be a consequence of Morse theory, 
no proof seems to exist for an arbitrary dimension^. Indeed, kernels such as the uniform or triangular kernel 
do create continuous ridges of modes even in ID. 

Convergence of MS An attractive feature of MS is that it is defined without step sizes, which makes 
it deterministic given the dataset and bandwidth. However, this is only useful if it converges at all, and 
whether this is the case depends on the kernel used. Not all kernels K give rise to convergent MS updates, 
even if they are valid kernels (i.e., they are nonnegative and integrate to one); an example where MS diverges 
appears in [29]. For kernels where MS diverges, it is of course possible to devise optimization algorithms that 
will converge to the modes, e.g. by introducing a line search, but we lose the simplicity of the MS iteration. 
For kernels where MS converges, the convergence is to a mode for most initial points, but in general to a 
stationary point (minima and saddle points in addition to modes). Here, we review convergence results for 
MS, with a focus on the most common kernels (Gaussian and Epanechnikov). 

With the Epanechnikov kernel, convergence occurs in a finite number of iterations [29]. The intuitive 
reason for this is that the KDE, which is the sum of N kernels, is piecewise quadratic. There is a finite 

^In dimension one, this is simple to prove assuming infinite differentiabilty (as for the Gaussian kernel), by setting / to the 
gradient of the mixture in the following result. Let f(x) be an infinitely differentiable real function of x G K. Then either 
/ is identically zero or its zero crossings are isolated. Indeed, if / is zero in a nonempty interval {a,b), then Va; € (n, 6) we 
have lim/i_>o {f{x + h) — f(x))/h = 0 = fix). Repeating this argument for /', /", etc. we obtain that all derivatives at any 
X G (a, b) are zero, hence by Taylor’s theorem f(x) = 0 Va; G K. 



number of “pieces” (each corresponding to a possible subset of the N kernels), and, within one piece, a single 
Newton iteration would find the maximum (indeed, the MS update coincides with a Newton step for the 
Epanechnikov kernel [35]). 

With the Gaussian kernel, MS also converges, but in an infinite number of iterations, and the convergence 
rate is generally linear (sublinear or superlinear in limit cases) [10, 14]. One convenient way to see this is by 
relating MS with the EM algorithm [56], as follows. 

In general for mixtures of Gaussians (using arbitrary covariance matrices) or other kernels (and thus in 
particular for KDEs), Gaussian MS is an expectation-maximization (EM) algorithm and non-gaussian MS is 
a generalized EM algorithm [10, 14]. This can be seen by defining a certain dataset and a certain probabilistic 
model with hidden variables (missing data), deriving the corresponding EM algorithm to maximize the log- 
likelihood for it, and verifying it coincides with the MS algorithm. For the GM case, the model consists 
of a constrained mixture of Gaussians, where each component has the given weight, mean and covariance 
matrix (constant, data point and bandwidth for KDEs, respectively), but the whole mixture can be freely 
(but rigidly) translated. The dataset consists solely of one point located at the origin. Thus, maxima of 
the likelihood occur whenever the translation vector is such that a mode of the mixture coincides with the 
origin. The missing data is the index of the mixture component that generated the origin. The resulting 
E step computes the posterior probabilities p(n|x) of eq. (3), and the M step maximizes a lower bound on 
the log-likelihood in closed form giving the MS update. With non-gaussian kernels, the M step cannot be 
solved in closed form, and the MS update corresponds to a single iteration to solve this M step. (Whether 
this iteration actually increases the likelihood, leading to convergence, depends on the kernel.) 

Viewing MS as an EM algorithm has several consequences. Convergence for Gaussians MS is assured by 
the EM convergence theorems [56] (which apply because the Q function in the EM algorithm is continuous), 
and each iterate increases or leaves unchanged the density. General results for EM algorithms also indi¬ 
cate that the convergence order will typically be linear (with a rate dependent on the ratio of information 
matrices). Finally, MS can be seen as a bound optimization (see also [35]). 

The convergence order can be studied in detail [14] by linearizing the MS update x ^ f(x) of eq. (3) 
around a mode x*, so the update can be written as — x* « J(x*)(x*^”) ~x*), where J(x*) is the D x D 

Jacobian of f . The Jacobian of f and the Hessian of p are related as follows: 


J(x*) = 4s(x*) V2p(x*) = ^(J(x*) - I) 


N 

S(x*) = ^ p(n|x*)(x„ - x*)(x„ - x*)'^ (5) 

n—1 


where S(x*) is the local covariance matrix at x*, since (from x* = f(x*) and the definition of f) the local 
mean is x* itself. The eigenvalues of J(x*) are in (0,1) and the convergence rate is given by the largest 
one, with the iterates approaching x* along the eigenvector associated with the largest eigenvalue (assuming 
distinct eigenvalues). This shows that the convergence order depends on the bandwidth: it is nearly always 
linear, approaches superlinear convergence when cr —>■ 0 or ct —>■ oo, and converges sublinearly at mode 
merges. The practically useful cases of MS use an intermediate a value, for which the rate r of linear 
convergence (i.e., the ratio of distances to the mode after and before the update) can be close to 1, thus 
convergence will be slow. The MS iterates smoothly approach the mode along the principal component of 
the local covariance matrix of the data points, from within the convex hull of the data points (see fig. IB). 


Other properties of MS We focus on Gaussian MS on a KDE, i.e., with isotropic covariances of band¬ 
width cr. As seen in fig. IB, if initialized at points far from a mode, the first MS steps are often large 
and make considerable progress towards the mode. This is an advatageous property generally observed in 
alternating optimization algorithms (such as EM). After that, the steps become small, in agreement with 
the linear convergence rate. 

The path followed by the MS iterates has the following properties (illustrated in fig. IB). (1) Near 
convergence, the path follows the direction of the principal eigenvector of the local covariance at the mode 
[14]. (2) Each iterate is a convex linear combination of the data points, so the path lies in the interior of 
the convex hull of the data points [9, 14]. This is also true of non-gaussian kernels if they satisfy K' > 0. 
(3) The path is smooth in the sense that consecutive steps (consecutive mean-shift vectors f(x) — x) always 
make an angle in (—f, f) [29]. (4) The mean shift vector f(x) — x is proportional to the gradient of p, so 
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the MS iteration is a gradient step, but the step size is not the best one (i.e., does not maximize p) along 
the gradient. 

In itT-means, the centroids partition the space into Voronoi cells, which are convex polytopes. In MS, it is 
harder to characterize the regions that the modes partition the space into, i.e., their domains of convergence, 
which can in fact have surprising properties [14]. In general, they can be curved, nonconvex and disconnected 
(for example, one domain can be completely surrounded by another). This allows MS to represent complex¬ 
shaped clusters and is an advantage over itl-means. A less desirable aspect is that the domain boundaries can 
show fractal behavior, although this seems confined to cluster boundaries, and could be removed if necessary 
by postprocessing the clusters. This fractal behavior is due to the iterated MS mapping f(x), and would not 
occur if we defined the clusters purely based on flow lines of the gradient. 

Convergence of BMS Cheng [24] proved convergence of blurring mean-shift, as follows. (I) For kernels 
broad enough to cover the dataset X (e.g. infinite-support kernels such as the Gaussian) convergence is to a 
dataset with all points coincident = ... = x^^), regardless of the value of a. This can be seen 

by noting that the diameter of the data set decreases at least geometrically. (2) For finite-support kernels 
and small enough ct, convergence is to several clusters with all points coincident in each of them; the clusters 
depend on the value of a. 

Another proof can be obtained from the matrix formulation of BMS [13], since at each iteration the 
dataset X is multiplied times a stochastic matrix P(X). By the Perron-Frobenius theorem [47, ch. 8], with 
broad kernels this will have a single eigenvalue equal to 1 and all other eigenvalues with magnitude less 
than 1. Since a fixed point verifies X = XP then X = xl^ for some x G i.e., all points coincide. For 
non-broad kernels, the unit eigenvalue is multiple, resulting in multiple clusters where all points coincide. 

While, for Gaussian BMS, the dataset converges to a size (therefore variance) zero, its variance need not 
decrease monotonically, in fact it is easy to construct examples in ID where it increases at some steps. 

Carreira-Perpihan [13] completely characterized the behavior with the Gaussian kernel (Gaussian BMS), 
assuming that the dataset is infinite with a Gaussian distribution. In this case, one can work with distri¬ 
butions rather than finite samples, and the iteration X •<— XP can be written as a Gaussian integral and 
solved in closed form. The result is that the data distribution p(x) is Gaussian after each iteration, with 
the same mean, and it shrinks towards its mean independently along each principal axis and converges to 
it with cubic order. Specifically, the standard deviation s along a given principal axis evolves as s ^ r(s)s 
with r(s) = G (0,1), where a is the BMS bandwidth, and s converges to 0 cubically. The reason 

for this extremely fast convergence is that, since a is kept constant but the dataset shrinks, effectively a 
increases. Thus, at each iteration both s and , , , .o decrease. Note that the smaller the initial s is, the 

faster the convergence and so the direction of largest variance (principal component) collapses much more 
slowly (in relative terms) than all other directions. 

This explains the practical behavior shown by Gaussian BMS (see fig. 6): (1) clusters collapse extremely 
fast (in a handful of iterations, for a suitable bandwidth); (2) after a few iterations only the local principal 
component survives, resulting in temporary linearly-shaped clusters (that quickly straighten). These two 
behaviors make BMS useful for clustering and denoising, respectively. 

The same proof technique applies to the generalized Gaussian BMS algorithms that use an update of the 
form X ^ X(;1'(P(X)). With a Gaussian dataset, each iteration produces a new Gaussian with a standard 
deviation (separately along each principal axis) s G- |^(r(s))| s. This allows a complete characterization of 
the conditions and order of convergence in terms of the real function 4>{r), r G (0,1), instead of a matrix 
function. Gonvergence occurs if |<()(r)| < 1 for r G (0,1) and 4>{1) = I. Depending on (/>, the convergence 
order can vary from linear to cubic and beyond [15]. 
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2.4 Relations with other algorithms 

Spectral clustering As noted earlier, putting Gaussian BMS in matrix form in terms of a Gaussian 
affinity matrix W uncovers an intimate relation with spectral clustering. Each BMS iteration is a product 
X ^ XP of the data times P = (p(n|xm))„m) the stochastic matrix of the random walk in a graph [25, 58], 
which in BMS represents the posterior probabilities of each point under the kernel density estimate (1). P 
is closely related to the matrix N = D~2 WD“2 (equivalent to the normalized graph Laplacian) commonly 
used in spectral clustering, e.g. in the normalized cut [76]. The eigenvalue/eigenvector pairs (^„,u„) and 
(An, v„) of P and N satisfy = A„ and u„ = D ~2 v„. In spectral clustering, given a and X one computes 
the eigenvectors associated with the top K eigenvalues of N (if K clusters are desired); in this spectral space 
the clustering structure of the data is considerably enhanced and so a simple algorithm such as Tf-means 
can often find the clusters. In BMS, we iterate the product X ^ XP. If P were kept constant, this would 
be the power method [39] and each column of X would converge to the leading left eigenvector of P (the 
vector of ones, i.e., a single cluster), with a rate of convergence given by the second eigenvalue < 1 
(the Fiedler eigenvalue in spectral clustering). However, the dynamics of BMS is more complex because P 
also changes after each iteration. In practice P and X quickly reach a quasistable state where points have 
collapsed in clusters which slowly approach each other and P remains almost constant (at which point BMS 
is stopped). Thus, BMS can be seen as refining the original affinities into a matrix consisting of blocks of 
(nearly) constant value and then (trivially) extracting piecewise-constant eigenvectors for each cluster with 
the power method. With the generalized BMS algorithm, one uses instead the matrix (('(P), which has the 
same eigenvectors vi,...,VAr as P but eigenvalues (()(Ai), ..., (()(AAr). However, this manipulation of the 
spectrum of P is performed implicitly, without actually having to compute the eigenvectors as in spectral 
clustering. 

While both spectral clustering and BMS rely on the random-walk matrix P, they differ in several respects. 
They do not give the same clustering results. In spectral clustering, the user sets the desired number of 
clusters K and the bandwidth u (if using Gaussian affinities), while in BMS the user sets only a and K is 
determined by this. Gomputationally, spectral clustering solves an eigenproblem (and then runs AT-means), 
while BMS (especially if using its accelerated version) performs a small number of matrix products (and 
then runs connected-components), thus it is considerably faster. 

Bilateral filtering and nonlinear diffusion Many image processing algorithms (for example, for image 
denoising) operate on range variables (intensity, color) defined on the spaee variables (of the image lattice). 
This includes bilateral filtering [63], nonlinear diffusion [89] and others. Mean-shift is basically an iterated, 
local averaging that operates jointly on range and space, and bears both similarities and differences with 
those algorithms, which are described in [4]. For example, in bilateral filtering the spatial component is fixed 
during iterations, so only the range variables are updated, and a stopping criterion is necessary to prevent 
excessive smoothing. 

Other mean-shift-like algorithms Mean-shift algorithms appear whenever we have expressions having 
the form of a sum over data points of a function of squared distances of the parameter x (whether the latter 
is a vector or a matrix or a set thereof). Equating the gradient of this expression to zero we can solve 
for X and obtain a hxed-point iteration with the form of a weighted average of the data points, where the 
weights depend on x. One example mentioned below are Riemannian centers of mass. Another example are 
Laplacian objective functions, of the form m=i 11^" ~ ||^ = 2 tr (XLX^), where X is of L x A^ are 

coordinates of the N data points in a low-dimensional space, W = (wnm) is an affinity matrix and L is its 
graph Laplacian. Alternating optimization over each x„ can be done with a mean-shift algorithm. Laplacian 
objectives appear in algorithms for (among other problems) dimensionality reduction and clustering, such 
as Laplacian eigenmaps [5], the elastic embedding [16] or spectral clustering [76]. However, as noted in 
section 2.3, the resulting mean-shift iteration need not converge in general. This is the case, for example, 
when the weights or the kernels can take negative values, as in the elastic embedding. In this case, one 
should use a line search or a different optimization algorithm altogether. 
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2.5 Extensions of mean-shift for clustering 

Tracking: mode finding over time In some applications, the distribution of the data changes over time, 
and we want to track clusters and their location, in order to predict their location at a future time (assuming 
the distribution changes slowly). One example is tracking a nonrigid object over a sequence of images, such 
as a video of a moving car. A robust way to represent the object is by the color histogram of all the pixels 
within the object. In its simplest form, one can use a region of fixed shape and size but variable location, 
initialized by the user in the first image. Then, the most likely location of the object in the next image can 
be defined as the region (near the current region) having the histogram closest to the current histogram. 
Comaniciu et al. [30] noted that we can use a differentiable KDE instead of a histogram, and that any 
differentiable similarity function of two KDEs also has the form (up to constant factors) of a weighted KDE 
to first order (a good approximation if the object moves slowly). Hence, we can use mean-shift iterations to 
maximize the (linearized) similarity over the location, thus finding the location where pixels look most like 
in the previous image’s region. The resulting algorithm is fast enough to track several objects in real-time 
in video, and is robust to partial occlusion, clutter, distractors and camera motion. 

Hall et al. [43] define a spatiotemporal KDE p(x, t) with a product kernel Arx(x — x„; (j^)Kt{t — at). 
The kernel over the spatial variables x is typically Gaussian and the kernel over the temporal variable t 
considers only past observations and gives more importance to recent ones. At each new value of t, one finds 
modes over x by starting from the modes of the previous time value (as in scale-space theory, but using time 
rather than scale). 

Manifold data The original mean shift algorithm is defined on the Euclidean space K.^. Sometimes, the 
data to be clustered lies on a low-dimensional manifold, so the mean-shift iterates and the modes should 
also lie on this manifold. In some applications (such as motion segmentation, diffusion tensor imaging 
and other computer vision problems) this is a known Riemannian manifold, such as rotation matrices, 
Grassmann manifolds or symmetric positive definite matrices [6]. Subbarao and Meer [78] extended mean- 
shift to Riemannian manifolds by defining the squared distances appropriately with respect to the manifold 
(although this does not result in a proper KDE on the manifold). However, most times the manifold is not 
known a priori. Shamir et al. [73] extended mean-shift to 3D point clouds in computer graphics applications, 
by constraining the mean shift steps to lie on the surfaces of a triangulated mesh of the data. The Laplacian 
AT-modes described later uses a different approach to find modes that are valid patterns with data lying on 
nonconvex manifolds. 

Graph data The mean-shift algorithm operates on data where each data point x„ is defined by a feature 
vector in In some applications, the data is best represented as a weighted graph, where each vertex is a 
data point x„, and an edge (x„,Xm) represents a neighborhood relation between a pair of data points, with 
a real-valued weight 5(x„, x^) representing distance. This allows one to work with data that need not live in 
a Euclidean space, and can be used to represent distances along a manifold (for example, by approximating 
geodesics through shortest paths over the graph). 

Some algorithms have been proposed for graph data that are based on a KDE. They work by assigning to 
each data point a “parent” in the graph, which is another data point in their neighborhood having a higher 
density (roughly speaking). This results in a forest of directed trees that spans the graph, whose roots are 
“modes” in the sense of not having any neighboring point with higher density. Each tree corresponds to a 
cluster. The parent of a data point is defined as the data point that optimizes a criterion based on the KDE 
and the distances between data points. The crucial aspect is that this optimization is defined only over the 
N data points rather than over the space (as happens with /c-medoid algorithms). 

One of the earliest approaches is the algorithm of [51]. This uses as criterion (p(x) — p(x„))/5(x, x„), 
which is a numerical approximation to the directional gradient of the KDE. This is maximized over the data 
points X G {xi, ..., xtv} that are in a neighborhood of x„ (e.g. the k nearest neighbors or the points within 
a distance e). The output of this algorithm can be improved with cluster-merging based on topological 
persistence [23]. 

The medoidshift algorithm of [75] is more directly related to mean-shift. It uses as criterion the Rieman¬ 
nian center of mass (sometimes called Frechet mean [1, 64]) f(x) = argminxg{xi,...,XN} ^n), 

where Wn =p(x*^’’\x„) oc Ar(||(x('”) — x„)/CTp) is evaluated at the current iterate x*^”). The algorithm then 
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alternates computing the Riemannian center of mass (over the data points) with updating the weights. We 
recover mean-shift by using the squared Euclidean distance for 6 and optimizing over . Unlike mean-shift 
(which finds maxima of the KDE), this does not seem to maximize a global objective function, but rather 
maximizes a local objective at each iteration and data point. The algorithm gives better clusterings and is 
faster in its “blurring” version, where at every iteration the entire dataset is updated. In matrix form (using 
the random-walk matrix P and the matrix A = (S(xn,Xm))nm of pairwise distances, and updating all JV 
points synchronously), this takes the form X ^ argminxi,...,XAr (^P): where the minimization is column¬ 
wise. Hence, medoidshift can be seen as a discrete filter, while the BMS iteration X ^ XP is a continuous 
filter. As in the accelerated BMS, each iteration contracts the graph, reducing the number of points. It is 
possible to define other variations by using different types of distances S, such as the local Tukey median in 
the medianshift algorithm of [74]. 

Other extensions The mean-shift update as originally proposed [37] results from estimating the gradient 
of a density in two steps: first, one estimates the density with a KDE, and then one differentiates this. 
Sasaki et al. [71] directly estimate the gradient of the log-density by using score matching [49] and derive a 
mean-shift algorithm based on this (by equating the gradient to zero and obtaining a fixed-point iteration). 
In score matching, one does a least-squares fit to the true log-density gradient using a model. If the model 
is a linear combination of basis functions (one per data point), one obtains an update that is identical to 
the mean-shift update of eq. 3 but with weighted kernels (although, since these weights can be negative, 
this fixed-point iteration is not guaranteed to converge). A faster, parametric method is obtained by using 
fewer basis functions than data points. They observe better results than the original mean-shift with high¬ 
dimensional data. Note that, as seen in eq. (4), using an adaptive KDE (e.g. with bandwidths obtained using 
the entropic affinities) also gives a weighted mean-shift update, but the weights obey a different criterion. 

Ranking data consists of permutations of a given set of items, and are discrete. Meila and Bao [57] 
extended blurring mean-shift to ranking data by using the Kendall distance rather than the Euclidean 
distance, and by rounding the continuous average of eq. (3) to the nearest permutation after each iteration 
(so the algorithm stops in a finite number of steps). 

Eunctional data, such as a collection of curves or surfaces, live in an infinite-dimensional space. Mean- 
shift clustering can be extended to functional data, where it corresponds to a form of adaptive gradient 
ascent on an estimated surrogate density [26]. 

2.6 Extensions of mean-shift beyond clnstering: manifold denoising 

The mean-shift iteration (2) or (3) is essentially a local smoothing, which provides a smooth or denoised 
version of a data point as a weighted average of its nearby data points. Clustering is just one specific 
problem that can make use of this smoothing. Here we describe work based on mean-shift for manifold 
denoising. This was already pointed out in [37]. Consider data lying on a low-dimensional manifold in 
but with noise added (as in fig. 3). If we run one mean-shift iteration for each data point in parallel, and 
then replace each point with its denoised version, we obtain a denoised dataset, where points have moved 
towards the manifold. Repeating this eventually compresses the dataset into clusters, and this is the basis 
of the BMS algorithm, although of course it destroys the manifold structure. However, if we stop BMS very 
early, usually after one or two iterations, we obtain an algorithm that can remove noise from a dataset with 
manifold structure. 

The denoising ability of local smoothing was noted independently in the computer graphics literature 
for the problem of surface fairing or smoothing. Earlier, the Laplacian had also been used to smooth finite 
element meshes [46]. In surface smoothing, 3D point clouds representing the surface of an object can be 
recorded using LIDAR, but usually contain noise. This noise can be eliminated by Laplacian smoothing, 
which replaces the 3D location of each point with the average of its neighbors [79, 80]. The neighbors of 
each point are obtained from a triangulated graph of the cloud, which is usually available from the scanning 
pattern of LIDAR. Typically, the Laplacian is constructed without the notion of a KDE or a random-walk 
matrix, and both the values of its entries and the connectivity pattern are kept constant during iterations. 

One important problem of BMS (or Laplacian smoothing) is that points lying near the boundary of 
a manifold or surface will move both orthogonally to and tangentially along the manifold, away from the 
boundary and thus shrinking the manifold. While this is good for clustering, it is not for surface smoothing 
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Figure 3: Denoising a spiral with outliers over iterations (r = 0 is the original dataset). Each box is the 
square [—30, 30]^, where 100 outliers were uniformly added to an existing 1 000-point noisy spiral. Algorithms: 
MBMS and BMS, both with the same bandwidth a for the Gaussian kernel. MBMS denoises while preserving 
the spiral structure and ignoring the outliers. BMS locally collapses points onto clusters, destroying the spiral. 
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Figure 4: Sample pairs of (original,MBMS-denoised) images from the MNIST dataset. 


because it distorts the object shape, or for manifold learning because it distorts the manifold structure. For 
example, in the MNIST dataset of handwritten digit images [52], motion along the manifold changes the style 
of a digit (e.g. slant, thickness). Various approaches, such as rescaling to preserve the object volume, have 
been proposed in computer graphics [33]. In machine learning, the manifold blurring mean-shift (MBMS) 
algorithm [86] is an extension of BMS to preserve the local manifold structure. Rather than applying to each 
point the mean-shift vector directly, this vector is corrected to eliminate tangential motion. The mean-shift 
vector is first projected onto the local tangent space to the manifold at that point (estimated using local PCA 
on the nearest neighbors of the point), and this projection is removed from the motion. Hence, the motion 
is constrained to be locally orthogonal to the manifold. Comparing the local PCA eigenvalues between the 
orthogonal and tangent spaces gives a criterion to stop iterating. 

Fig. 4 shows MNIST images before and after denoising with MBMS. The digits look smoother (as if they 
had been anti-aliased to reduce pixelation) and easier to read (compare the original Ol2.3‘^£'i’?S9 
vs the denoised O I 1. 3 ^ S Co J 8 ^), and indeed a classifier trained on the denoised data performs better 
[86]. While this smoothing homogenizes the digits somewhat, it preserves distinctive style aspects of each 
digit. MBMS performs a sophisticated denoising (very different from simple averaging or filtering) by intel¬ 
ligently closing loops, removing or shortening spurious strokes, enlarging holes, removing speckle noise and, 
in general, subtly reshaping the digits while respecting their orientation, slant and thickness. MBMS has 
also been applied to the identification of the structure of tectonic faults from seismic data [41]. 

(Manifold) denoising can be used as a preprocessing stage for several tasks, leading to more robust models, 
such as dimensionality reduction [86], classification [44, 86], density estimation [42] or matrix completion 
[ 88 ]. 

In some algorithms, one can establish a continuum between clustering and dimensionality reduction, 
with clustering being the extreme case of dimensionality reduction where the manifold dimensionality is 
zero. This is the case with MBMS, where a tangent space of dimension zero recovers BMS. Another case is 
spectral clustering and spectral dimensionality reduction, e.g. normalized cut [76] vs Laplacian eigenmaps 
[5]: both operate by extracting eigenvectors from the graph Laplacian, but in spectral clustering the number 
of eigenvectors is the number of clusters, while in spectral dimensionality reduction one cluster is assumed 
and the eigenvectors correspond to degrees of freedom along the manifold defined by the cluster. 
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2.7 Accelerated mean-shift algorithms 

Computationally, mean-shift algorithms are slow, because their complexity is quadratic on the number of 
points (which would limit their applicability to small data sets), and many iterations may be required to 
converge (for MS). Significant work has focused on faster, approximate mean-shift algorithms, using a variety 
of techniques such as subsampling, discretization, search data structures [70], numerical optimization [61], 
and others, which we review below. Some of them are specialized to image segmentation while others apply 
to any data. A practical implementation of a fast mean-shift algorithm could combine several of these 
techniques. In addition, although it does not seem to have been studied in detail, mean-shift algorithms 
benefit from embarrasing parallelism, since the iterations for each point proceed independently from the rest 
of the points. 

Accelerating mean-shift (MS) With the Epanechnikov kernel, mean-shift converges in a finite number 
of iterations. If the neighborhood structure of the data is known a priori, as is the case in image segmentation 
[29], MS can be very efficient without any approximations, since the iteration for each point involves only 
its neighbors. 

With the Gaussian kernel and with arbitrary data, for which the neighborhood structure is not known and 
varies over iterations, the runtime of MS can be large, particularly with large datasets, and much work has 
tried to accelerate Gaussian MS. Computationally, MS has two bottlenecks [12]: (I) Accurately converging 
to a mode can require many iterations, since its convergence is typically linear and can approach sublinearity 
in regions where two modes are close. (2) Each iteration is linear on the number of points N, because it is 
an average of the data. Thus, running MS for the entire dataset is 0{IDN'^), where I is the average number 
of iterations per point and D the dimension. Eor example, in image segmentation, typical values are 10 to 
100 for / (depending on the accuracy sought and the bandwidth a), D < 5 (higher if using texture features) 
and thousands to millions for N (the number of pixels). Acceleration algorithms should attack either or 
both bottlenecks, while keeping a low approximation error, i.e., producing a clustering (assignment of points 
to modes found) that is close to that of the naive MS—otherwise one would really be running a different 
clustering algorithm. 

As for accelerating the convergence, the obvious answer is using Newton’s method (with a modified 
Hessian to ensure ascent), because it has quadratic convergence, computing the Hessian of a Gaussian 
mixture is simple [9] and Newton iterations are typically not much costlier than those of MS. The reason 
is that, for the low-dimensional problems for which MS is best suited, computing the Newton direction 
(which involves solving a D x D linear system) is not costly compared to computing the gradient or Hessian 
themselves. However, Newton’s method does have other problems with MS: (I) it introduces user parameters 
such as step sizes or damping coefficients; (2) it need not be effective far from a mode, where the Hessian is 
undefined; (3) since Newton’s method is very different from MS, it can make a point converge to a different 
mode than under MS. A solution to this is to use a MS-Newton method [12] that starts with MS iterations 
(which often suffice to move the iterate relatively close to a mode) and then switches to Newton’s method. 

As for reducing the cost of one iteration below 0{N), the most obvious strategy is to approximate the 
average in each iteration with a small subset of the data, namely the points closest to the current iterate. 
This can be done by simply ignoring faraway points in the average, which introduces an approximation 
error in the modes, or by updating faraway points infrequently as in the sparse EM algorithm of [12], which 
guarantees convergence to a true mode. Unfortunately, either way this requires finding nearest neighbors, 
which is itself 0{N) unless we know the neighborhood structure a priori. And, if the bandwidth is relatively 
large (which is the case if we want a few clusters), then a significant portion of the data have non-negligible 
weights in the average. Another approach is to use A-body methods such as the (improved) fast Gauss 
transform [40, 93] or various tree structures [85]. These can reduce the 0{N‘^) cost of one iteration for every 
data point to 0{N) or 0{N log N), respectively. However, this implies we run synchronous iterations (all 
points move at once), and does not scale to high dimensions (because the data structures involved grow 
exponentially with the dimension), although it is generally appropriate for the low dimensions used in image 
segmentation. 

A final approach consists of eliminating many of the iterations for many of the points altogether, in effect 
predicting what mode a given point will converge to. This cannot be simply done by running MS for a small 
subset of the points and then assigning the remaining points to the mode of their closest point in the subset. 
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because mean-shift clusters have complex shapes that significantly differ from a nearest-neighbor clustering, 
and the clustering error would be large. A very effective approach for image segmentation is the spatial 
discretization approximation of [12]. This is based on the fact that the trajectories followed by different 
points as they converge towards a given mode collect close together (see section 2.3 and fig. IB). Thus, we 
can run the naive MS for a subset of the points, keeping track of the trajectories they followed. For a new 
point, as soon as we detect that its iterate is close to an existing trajectory, we can stop and assign it to that 
trajectory’s mode. With image segmentation (even if the number of features, and thus dimensions, is large), 
the trajectories can be coded by discretizing the image plane into subpixel cells and marking each cell the 
first time an iterate visits it with the mode it converges to. Since nearly all these cells end up empty, the 
memory required is small. This reduces the average number of iterations per point I to nearly 1 with only 
a very small clustering error [12]. This error can be controlled through the subpixel size. 

Yuan et al. [94] show that, if y = f(x) is the result of applying a mean-shift step to a point x, then 
all the points z within a distance ||y — x|| from y have a KDE value p{z) > p(x). Not all these points z 
actually converge to the same mode as x, however many do. A fast mean-shift algorithm results from not 
running mean-shift iterations for all those points and assigning them to the same mode as x. Hence, as in the 
discretization approach above, only a few data points actually run mean-shift iterations. The approximation 
error was not controlled in [94], although one could use instead a distance r ||y — x|| and use r G [0,1] to 
control it. 

Paris and Durand [62] combine several techniques to accelerate mean-shift image segmentation, including 
spatial discretization and ID convolutions (classifying most pixels without iterating, as above), and construct 
a hierarchy of clusterings based on the notion of topological persistence from computational geometry [34]. 
The algorithm is fast enough to segment large images and video, although it is not clear how good an 
approximation it is to the true mean-shift. 

As the dimension of the data points increases, searching for nearest neighbors (to compute the mean 
shift update in eq. (2) or (3) with a truncated kernel) becomes a bottleneck. One can then use approximate 
nearest-neighbor algorithms. For example, Locality-Sensitive Hashing (LSH) [2] is used in [38]. 

Accelerating blurring mean-shift (BMS) A simple and very effective acceleration that has essentially 
zero approximation error was given in [13]. It essentially consists of interleaving connected-components 
and blurring steps. The fact that BMS collapses clusters to a single point suggests that as soon as one 
cluster collapses we could replace it with a single point with a weight porportional to the cluster’s number of 
points. This will be particularly effective if clusters collapse at different speeds, which happens if they have 
different sizes, as predicted in section 2.3; e.g. see fig. 6. The total number of iterations remains the same 
as for the original BMS but each iteration uses a dataset with fewer points and is thus faster. Specifically, 
the Gaussian kernel density estimate is now p(x) = where p(x|n) = A/'(x; x„, cr^I) and the 

posterior probability is p{n\xm) = p(xm|n)7r„/p(xm) as in [9]. At the beginning 7r„ = -^ Vn and when 
clusters m and n merge then the combined weight is + 7r„. Using the matrix notation of section 2.1 we 
have Wnm oc p{xm\n), n = diag (7r„), dm = Yln=i WnmT^n = pi'^m) and (nWD~i)„m = p(n|xm). This can 
be proven to be equivalent to the original BMS. 

The reduction step where coincident points are replaced with a single point can be approximated by a 
connected-components step where points closer than e are considered coincident, where e > 0 takes the same 
value as in the final connected-components step of BMS (fig. 2B). Thus, e is the resolution of the method 
(below which points are indistinguishable), and while BMS applies it only after having stopped iterating, 
the accelerated version applies it at each iteration. Hence, the accelerated BMS algorithm alternates a 
connected-components graph contraction with a BMS step. 

Experimentally with image segmentation, a remarkable result arises: the runtime of this accelerated BMS 
is almost constant over bandwidth values, unlike for both MS and BMS, corresponding to about 4 to 5 BMS 
iterations, and is dominated by the first few iterations, where almost no merging occurs. This means a 
speedup factor of 2 to 4 over BMS and 5 to 60 over MS [13]. 
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3 X-modes and Laplacian i^-modes algorithms 

As discussed earlier, equating modes with clusters as in mean-shift implies we lose direct control on the 
number of clusters and do not differentiate between meaningful and non-meaningful modes. Recently, a, K- 
modes algorithm [17] has been proposed that addresses these problems^. The (Gaussian) AT-modes algorithm 
takes both K and a as user parameters and maximizes the objective function 


max 

z,c 


1 

A 


K N 





s.t. ZIk = Iw, z G 


( 6 ) 


over the cluster centroids G and the assignments of points to clusters Znk G {0;1}) where G is the 
Gaussian kernel. For a given assignment Z, this can be seen as the sum of a KDE as in eq. (1) but separately 
for each cluster. Thus, a good clustering must move centroids to local modes, but also define K separate 
KDEs. This naturally combines the idea of clustering through binary assignment variables (as in iG-means 
or AT-medoids) with the idea that, for suitable bandwidth values, high-density points are representative of 
a cluster (as in mean-shift). As a function of the bandwidth cr, the AT-modes objective function becomes 
AT-means for ct —>■ oo and a version of AT-medoids for cr —>■ 0. The training algorithm alternates an assignment 
step as in A'-means, where each data point is assigned to its closest centroid, with a mode-finding step as 
in mean-shift, but only for each centroid in the KDE defined by its current cluster (rather than for each 
data point). A more robust, homotopy-based algorithm results from starting with cr = oo (i.e., AT-means) 
and gradually decreasing cr while optimizing the objective for each cr. The computational complexity of this 
algorithm is the same as for A'-means {0{DNK) per iteration), although it requires more iterations, and is 
thus much faster than mean-shift {0{DN‘^) per iteration). 

AT-modes obtains exactly AT modes even if the KDE of the data has more or fewer than AT modes, because 
it splits the data into K KDEs. Using the homotopy-based algorithm, it tends to track a major mode in 
each cluster and avoid outliers. Thus, A'-modes can work well even in high dimensions, unlike mean-shift. 

The fundamental disadvantage of AT-modes is that the clusters it defines are convex, as with AT-means, 
since they are the Voronoi cells defined by the K modes. This is solved in the Laplacian AT -modes algorithm 
[87], which minimizes the objective function 
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S.t. ZIk = 1n, Z > 0. 


(7) 


This relaxes the hard assignments of A'-modes to soft assignments (so each point may belong to different 
clusters in different proportions) and adds a term to the objective with a weight A > 0 that encourages neigh¬ 
boring points to have similar soft assignments. This term can be equivalently written as A tr (Z^LZ) in terms 
of the graph Laplacian L = D — W constructed from an affinity matrix W, where D = diag (X]m=i ’^nm) is 
the degree matrix. The training algorithm is as with A'-modes, but the assignment step becomes a convex 
quadratic program on Z, which can be solved efficiently with a variety of solvers. 

Laplacian AT-modes becomes the AT-modes algorithm with A = 0, and a Laplacian K-means algorithm 
with cr = oo. The introduction of the Laplacian term allows the clusters to be nonconvex, as happens with 
other Laplacian-based algorithms such as spectral clustering. However, here we solve a quadratic program 
rather than a spectral problem, and obtain a KDE, centroids (= modes) and soft assignments of points to 
clusters (unlike spectral clustering, which only returns hard assignments). 

An out-of-sample mapping to predict soft assignments z(x) for a test point x not in the original training 
set can be obtained by augmenting (7) with x and solving but keeping the centroids and the soft assignments 
for the training points fixed. The result equals the projection on the simplex of the average of two terms: 
the average of the assignments of x’s neighbors, and a term dependent on the distances of x to the centroids 
[87]. The assignment Zfc(x) can be readily interpreted as a posterior probability p(fc|x) of x belonging to 
cluster k. Hence, Laplacian AT-modes can be seen as incorporating nonparametric posterior probabilities into 
mean-shift (or spectral clustering). The usual way of obtaining posterior probabilities in clustering is by 
using a mixture model, for which an EM algorithm that maximizes the likelihood is derived. In contrast 

®There exists another algorithm called “iG-modes” [22, 48], but defined for categorical data, rather than continuous data. 
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Table 1: Comparison of properties of different clustering algorithms. 



A-means 

A-medoids 

Mean-shift 

Spectral 

clustering 

A-modes 

Laplacian 

A-modes 

Gaussian 
mixture, EM 

Centroids 

likely invalid 

“valid” 

“valid” 

N/A 

valid 

valid 

likely invalid 

Nonconvex 

clusters 

no 

depends 

yes 

yes 

no 

yes 

to some extent 

Density 

no 

no 

yes 

no 

yes 

yes 

yes 

Assignment 

hard 

hard 

hard 

hard 

hard 

soft, nonparam. 

soft, param. 


to this parametric approach, in Laplacian itT-modes the assignments optimize an objective function that 
is designed specifically for clustering, unlike the likelihood; the clusters are not obliged to follow a model 
(such as Gaussian); and, consequently, the optimization algorithm does not depend on the particular type of 
clusters (unlike the EM algorithm, which depends on the cluster model). These soft assignments or posterior 
probabilities are helpful to estimate the uncertainty in the clustering or for other uses. For example, in 
image segmentation they can be used as a smooth pixel mask for image matting, conditional random fields 
or saliency maps. 

Cluster representatives: means, modes and medoids In centroid-based clustering algorithms, each 
cluster is associated with a centroid: the cluster mean in itT-means, an exemplar (data point) in iti-medoids, 
and the KDE mode in mean-shift, itT-modes and Laplacian iL-modes. A desirable property of centroids, 
which makes them interpretable, is that they should be valid patterns and be representative of their cluster 
(“look” like a typical pattern in the cluster). A well-known disadvantage of iL-means is that the mean 
of a nonconvex cluster need not be a valid pattern, since it may lie in a low-density area. This is often 
the case with clusters with manifold structure. For example, given a sequence of rotated digit-1 images 
//1 w w its mean • is not a valid digit-1 image itself. In A'-medoids, exemplars (data points) 
are valid patterns by definition, although when the data is noisy, individual exemplars may look somewhat 
atypical (see the manifold denoising discussion in section 2.6). Modes can strike an optimal tradeoff, in being 
typical, valid and yet denoised representatives, as follows. First, modes are by definition on high-density 
areas, so in this sense they are representative of the KDE and cluster. Second, a mode can be characterized 
as a scale-dependent, local average of the data. Indeed, a mode x* is a maximum of the density p, so 
its gradient at x* is zero, and this implies that x* equals the weighted average of the data in the sense 
of eq. (2) and (3) (where x* = f(x*)), as seen in section 2. As a function of the scale, a mode spans a 
continuum between equaling the regular mean of the whole data (a = oo) and equaling any individual data 
point (cr = 0). With an intermediate bandwidth in the KDE, they are local averages of the data and so 
can remove noise that affects each individual exemplar, thus being even more prototypical than actual data 
points. In this sense, (Laplacian) AT-modes achieves a form of intelligent denoising similar to that of MBMS 
(section 2.6). 

Table 1 compares several clustering algorithms in terms of whether their centroids are valid patterns, 
whether they can model nonconvex clusters, whether they estimate a density, and whether the cluster 
assignments are hard or soft. 

4 Examples, applications, software 

We illustrate MS and BMS in the clustering application where they have been most effective (image seg¬ 
mentation), and point out clustering applications where they are not so effective (high-dimensional data or 
data with manifold structure), as well as applications beyond clustering where MS and BMS have been used. 
Matlab code for most of the algorithms described may be obtained from the author. 

Image segmentation MS and BMS are most effective with low-dimensional data, and its most successful 
application is in image segmentation [29]. They are applied as follows. Given an image, we consider each 
pixel as a data point in a space consisting of two types of features: spatial features (the i and j location of 
a pixel) and range features (the intensity I or grayscale value, the color values, texture features, etc. of the 
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A: original image B: MS segmentation 



C: datapoints & iterates 


D: data/iter. (2D proj.) 



Figure 5: A: test image (cameraman 50 x 50 grayscale). B: Gaussian MS segmentation with bandwidth 
CT = 8 pixels, resulting in 5 clusters (shown in different colors), with the respective modes marked *. C: 
the dataset (pixels) and all the MS iterates for all starting points in space, colored by cluster. D: 

projection of the middle plot on (i, j) space. The paths for two starting pixels are shown in plots B-D. 



Figure 6: Sequence of datasets ..., obtained by Gaussian BMS for the CcunerEmicLn image of fig. 5 
with bandwidth a — 6, resulting in 7 clusters. We show the 2D projection on the spatial domain of every 
data point (pixel), colored by cluster, as in fig. 5D. Note: (1) points very quickly move towards a centroid 
and collapse into it (clustering property); (2) for each cluster-to-be, the local direction of maximum variance 
collapses much more slowly than the lower-variance directions, producing linearly shaped clusters (denoising 
property) that straighten and shorten. The BMS stopping criterion stopped BMS at iteration r = 11, where 
the dataset consists of 7 clusters of coincident points. If not stopping, these clusters would keep moving and 
eventually merge into a single cluster. 


pixel). The range features are scaled to span approximately the range of the spatial features. This way, all 
features and the bandwidth have pixel units. For example, for the image of fig. 5, we rescale the original 
intensity values to the range [0,100], so a feature vector (4,13,80) would correspond to the pixel located 
at coordinates (4,13), which has an intensity equal to 80% of the maximum intensity (white). The precise 
scaling will affect the clustering and should be done carefully. Using spatial features is beneficial because 
they introduce spatial coherence (nearby pixels tend to belong to the same cluster), although sometimes only 
the range features are used. One should use a perceptually uniform color space such as LAB rather than the 
RGB space, so that Euclidean distances approximately match perceptual differences in color [36]. 

Fig. 5 shows an example with a grayscale image of 50 x 50 pixels. Thus, the dataset contains N = 2 500 
points in 3D (location i,j and intensity I). Fig. 5 shows the result with MS while fig. 6 shows the result with 


19 




















































































Figure 7: Left: clustering of 5 spirals using Laplacian iF-modes (K = 5, cr = 0.2, the circle at the top right 
corner has a radius of cr), with modes denoted by ►, and contours of the KDE of the “red” cluster. Right: 
a subset of the 10 centroids obtained by iF-means, Laplacian LC-modes and mean shift for MNIST data. 


BMS. As noted earlier, MS and BMS generally give similar clusterings over a range of bandwidths. 

With the Gaussian kernel, reasonable clusterings arise for bandwidths around ^ of the image size (vertical 
or horizontal), although one should explore a range of bandwidths. In some computer vision applications, 
MS is sometimes used with a very small bandwidth in order to oversegment an image into uniform patches 
which are then further processed. 

Cases when mean-shift clustering does not work well Fig. 7 illustrates two situations where MS 
(and BMS) do not produce good clusterings. The first one is data having manifold structure (of lower 
dimension than the feature space), as in the ID spirals in a 2D space shown in the left panel. The KDE 
will have either many modes spread over the data manifolds (for smaller bandwidths) or few modes that 
mix manifolds (for larger bandwidths). No bandwidth value will cluster the spirals correctly. The Laplacian 
AT-modes algorithm is able to cluster the spirals correctly, while estimating a reasonable KDE and mode for 
each spiral [87]. 

The second situation is high-dimensional data, as in the MNIST handwritten digits [52], where each data 
point is a grayscale image of 28 x 28 pixels (i.e., D = 784 feature dimensions). In high-dimensional spaces, 
MS tends to produce either a single mode (for larger bandwidths), which is uninformative, or many modes 
(for smaller bandwidths), which often sit on small groups of points or outliers. In the results in fig. 7 (right 
panel), the MS bandwidth was tuned to produce K = 10 clusters, and AT-means and Laplacian AT-modes 
were run with AT = 10. Since the data forms nonconvex clusters, A'-means produces centroids that average 
distant data points, of possibly different digit classes, and thus do not look like digits. Again, Laplacian 
AT-modes can partition the data into exactly K clusters, while finding a mode for each that looks like a valid 
digit and is more representative of its class [87]. 

Multivalued regression and inversion In (univalued) regression, we assume there exists a mapping 
f: X S —>■ y S which assigns a unique vector y to each possible input x (i.e., f is a function in the 
mathematical sense). We estimate f given a training set of pairs {(x„, y„)}))Lj. This is the usual regression 
setting. In multivalued regression, the mapping f assigns possibly multiple vectors yi,y 2 ... to each input 
X. A classical example is learning the inverse of a (univalued) mapping g. For example, the inverse of 
5 ( 2 /) = is fix) = ±\/xi which has 0, 1 or 2 outputs depending on the value of x (smaller, equal or 
greater than zero, respectively). Carreira-Perpihan [8, 10, 11] proposed to represent a multivalued mapping 
f: X G R'^ — >■ y G R^ by the modes of the conditional distribution p(y|x). Hence, in this case, we apply 
mean-shift to a conditional distribution p(y|x). One advantage of this approach is that the number of inverses 
is selected automatically for each value of x; essentially, it is the number of “clusters” found by mean-shift in 
p(y|x). Once the modes have been found, one can also estimate error bars for them from the local Hessian at 
each mode [8, 10]. The conditional distribution may be obtained from a Gaussian KDE or Gaussian mixture 
p(x, y) for the joint distribution of x and y (since the marginal and conditional distributions of any Gaussian 
mixture are also Gaussian mixtures), but it can also be learned directly, for example using mixture density 


20 










5 




Figure 8: Left panels: illustration for a 2D robot arm of using conditional modes to represent multiple 
inverses. The plots correspond to the joints’ angle space 6 = (0i,02) and the end-effector workspace x = 
(xi,X 2 ) where we want to reach the point near (xi,X 2 ) = (0.8, 0.5) (which is part of the desired workspace 
trajectory in blue). That point x can be reached by two joint angle configurations, elbow up/down, which 
correspond to the two modes of the conditional distribution p(0|x) (whose contours are shown). Note that 
the mean in 0-space (magenta dot) does not correspond to a valid inverse. Right panel: profile view of the 
palate (black outline) and several tongue shapes corresponding to the modes (red) and mean (magenta) of 
the conditional distribution for a speech sound /i/ in the utterance “rag”. 


networks or particle filters [7]. 

Fig. 8 illustrates two examples. On the left panels, we learn the inverse kinematics mapping of a robot 
arm [65, 66]. The forward kinematics mapping x = g(0) is univalued and gives the position x in workspace 
of the arm’s end-effector given the angles 9 at the joints. The inverse mapping 9 = f(x) is multivalued, 
as shown by the elbow-up and elbow-down configurations. In this case, the conditional distribution p(0|x) 
was learned using a particle filter. On the right panel, we learn to map speech sounds to tongue shapes 
(articulatory inversion) [67]. The American English /j/ sound (as in “rag” or “roll”) can be produced by one 
of two tongue shapes: bunched or retroflex. The figure shows, for a specific /i/ sound, 3 modes in tongue 
shape space (two bunched and one retroflex). In both inverse kinematics and articulatory inversion, using 
the conditional mean (shown in magenta), or equivalently doing univalued least-squares regression, leads to 
invalid inverses. 

Other applications Algorithms based on mean-shift have also been applied to video segmentation [32, 
62, 84], image denoising and discontinuity preserving smoothing [29], and to object tracking [27, 30], among 
other problems, as well as for manifold and surface denoising (as described in section 2.6). 

5 Conclusion and open problems 

Mean-shift algorithms are based on the general idea that locally averaging data results in moving to higher- 
density, and therefore more typical, regions. Iterating this can be done in two distinct ways, depending on 
whether the dataset itself is updated: mode finding (MS), or smoothing (BMS), both of which can be used 
for clustering, but also for manifold denoising, multivalued regression and other tasks. 

In practice, mean-shift algorithms are very attractive because—being based on nonparametric kernel 
density estimates—they make few assumptions about the data and can model nonconvex clusters. The 
user need only specify the desired scale of the clustering but not the number of clusters itself. Although 
computationally slow, mean-shift algorithms can be accelerated and scale to large datasets. They are very 
popular with low-dimensional data having nonconvex clusters, such as image segmentation. The (Laplacian) 
AT-modes algorithms remain nonparametric but still perform well with high-dimensional data and outliers. 

Mean-shift is intimately related to kernel density estimates and Gaussian mixtures, and to neighborhood 
graphs. The graph Laplacian or random-walk matrix P arises as a fundamental object in clustering and 
dimensionality reduction that encapsulates geometric structure about a dataset. Although not usually seen 
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this way, MS/BMS algorithms are essentially based on P, and they alternate between smoothing the data 
with P (a power iteration), with possibly modified eigenvalues, and updating P itself. 

Some directions for future research are as follows. (1) Understanding the geometry of Gaussian mixtures 
and its modes in high dimensions. (2) Finding a meaningful extension of (blurring) mean-shift to directed 
graphs. (3) Designing, or learning, random-walk or Laplacian matrices that are optimal for clustering in 
some sense. (4) Finding a meaningful definition of clusters that is based on “bumps” of a kernel density 
estimate, i.e., distinct regions of high probability rather than modes. 
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A Connected-components algorithm 

Consider an undirected graph G = (U, E) with vertex set V and edge set E. A connected component of G is 
a maximal subset of V such that every pair of vertices in it can be connected through a path using edges in 
E [31]. Hence, the vertex set V can be partitioned into connected components. The connected component 
of a given vertex v G V can be found by depth-first search (DFS), which recursively follows edges adjacent 
to V until all vertices reachable from v have been found. This can be repeated for the remaining vertices 
until all connected components have been found, with a total runtime G(|U| -I- |i?|). 

Connected-components is a clustering algorithm in its own right. To use it, one provides a matrix of 
distances dnm between every pair of vertices (data points x„ and Xm) and a threshold e > 0, and defines a 
graph having as vertices the N data points and as edges (x„, x^) if dnm < e for n,m = 1,N (this is 
sometimes called an e-ball graph). The connected components of this graph give the clusters. 

Connected-components gives poor clustering results unless the clusters are tight and clearly separated. 
However, it can be reliably used as a postprocessing step with mean-shift algorithms, as described in the 
main text, to merge points that ideally would be identical. For example, with MS, the final iterates for points 
that in the limit would converge to the same mode are numerically different from each other (by a small 
amount if using a sufficiently accurate stopping criterion in the mean-shift iteration). Hence, these iterates 
form a tight cluster around their mode, which is widely separated from the tight clusters corresponding to 
other modes. 

Naively implemented (fig. 9 left), connected-components would run in 0{DN'^) time, because to construct 
the graph we have to threshold all pairwise distances, each of which costs 0{D). However, in the case of 
tight, clearly separated clusters, we need not construct the graph explicitly, and it runs in 0{DNK) if there 
are K connected components. Let the data satisfy the following “tight clusters” assumption: there exists 
e > 0 that is larger than the diameter of each component (the largest distance between every pair of points 
in a component) but smaller than the distance between any two components (the smallest distance between 
two points belonging to different components). Then the connected components can be found incrementally 
by connecting each point to the representative of its component (fig. 9 right), where the representative of 
component fc is c^, which is a point in that component. This can be done on the fly, as we process each data 
point in MS, or given the final set of iterates in BMS. 

Some heuristics can further accelerate the computation. 1) The distances can be computed incrementally 
(dimension by dimension), so that as soon as e is exceeded we exit the distance calculation. In the tight 
cluster assumption, this could reduce the cost to 0{N{K + D)), since for distances above e each individual 
distance dimension will typically exceed e. 2) In image segmentation, we can scan the pixels in raster order, 
so that for most pixels their component is the same as for the previous pixel (this will not be true when 
crossing a cluster boundary, but that happens infrequently). Then, when computing the distance, we always 
try first the last pixels’s component. The average cost for the entire connected-components runtime is then 
0{ND), because most pixels compute a single distance. 

The value of e depends on the problem but it can usually be chosen in a wide interval. For example, in 
image segmentation with features in pixel units, we can set e = 0.5, since we do not need subpixel accuracy 
to locate a mode, and modes must be at least one pixel apart to be meaningful. In BMS, we can safely set 
e to a smaller value (say, 0.01), since its cubic convergence rate produces extremely tight clusters quickly. 


22 



A. Naive connected-components algorithm B. Efficient connected-components algorithm 


Define an e-ball graph: 


K i — 1, Cl i — Xi first component 

for n = 2 to N 

• vertices xi,..., xjv 


for fc = 1 to AT 

• edges (x„,Xm) (f(x„,x„) < e. 


if <C 6 old component 

Vn, m = \,... ,N. 


assign x„ to component fc: break 
if x„ was not assigned 

Apply DFS to this graph. 


K i — K -\- 1, C/e component 


Figure 9: Pseudocode for connected-components, implemented naively in 0{DN‘^) (left) and under the “tight 
clusters” assumption in 0{DNK) (right), where K is the number of components. In all cases, the input is a 
dataset xi,..., xjv G a distance function d{-, •) applicable to any pair of points, and a threshold e > 0. 
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